Analysis module 1 (strict): Additional filtering steps for breastmilk samples RoVI study — 21 July, 2022

Sidebar

Stringent filtering - breastmilk

Merge with stool samples

Run-to-run variation

10% validation

Stringent filtering - breastmilk

Summary

  • See html outputs of module 1 for full details
  • Filtering of all reads performed based on size, taxonomy and presence in ≥2 samples
  • 23 RSVs removed from BM samples based on abundance-based filtering using the decontam package
  • Overall, 50,579 RSVs across BM samples in ps1; 7,386 retained in ps7 (after filtering), compared with 4,662 in infant stools and 4,423 in maternal stools
  • In breastmilk sequencing runs, consistent composition seen in positive controls based on abundance-weighted metrics (Shannon/weighted Bray-Curtis), but less consistent based on unweighted metrics (richness/unweighted Bray-Curtis)
  • High reproducibility in 10% validation subset (R-squared >0.9 for alpha diversity, beta diversity, and genus abundances)

Composition of breastmilk samples and extraction controls

      
       India Malawi  UK
  BM1    324    129  91
  BM2    291     91  40
  BM3    288     85  41
  WCBM    39      1  16

Taxa are displayed if they had a mean relative abundance of ≥2% in BM samples at ≥1 timepoint (BM1, BM2, or BM3) in ≥1 country. Taxonomic composition of extraction controls clearly differs from that of breastmilk samples, with reduced Streptococcus abundance and increased abundance of rare taxa (grouped as ‘Other’ in the plot above).

Sample distribution

      
       India Malawi  UK
  BM1    324    129  91
  BM2    291     91  40
  BM3    288     85  41
  WCBM    39      1  16

Count distribution in controls


 0-100 100-1k 1k-10k   >10k 
     0      0      0     56 

Beta diversity before stringent filtering

PCoA plots of Bray-Curtis distances support the view that the majority extraction controls cluster separately from breastmilk samples.

Pre-filtering

Summary
  • n samples: 1380
  • n controls: 56
  • n features: 7528
  • n reads: 1.4330148^{8}

Strict filter 1: retain taxa present at ≥0.1% abundance in ≥1% of samples from ≥1 country

Summary
  • n samples: 1380
  • n controls: 56
  • n features: 1800 (23.9%)
  • n features removed: 5728
  • n reads: 1.3234547^{8} (92.4%)
  • Overall, 8% of breastmilk reads removed, but 76% of taxa filtered.

Strict filter 2: remove taxa more prevalent in extraction controls

Taxa are displayed if they occurred with greater prevalence in breastmilk extraction controls than samples (Fisher’s p<0.05, as determined using decontam package).

Summary
  • n samples: 1380
  • n controls: 56
  • n features: 1718 (22.8%)
  • n features removed: 82
  • n reads: 1.2611155^{8} (88%)

Strict filter 3: remove samples that cluster with extraction controls

                     
                      IND MLW  UK
  BM1                 324 129  91
  BM2                 291  91  40
  BM3                 288  85  41
  extraction controls  39   1  16

For each sample, the mean distance is calculated from (i) other breastmilk samples from the same country and (ii) all breastmilk extraction controls. If the sample clusters with other samples, the ratio of these distances will be <1. If the sample clusters with extraction controls, the ratio will be >1. Overall, 144/1436 (10%) samples had a ratio of >1 based on either weighted or unweighted Bray-Curtis distances and were therfore categorised as potentially contaminated. By contrast, 0/0 (NaN%) negative extraction controls had a ratio of >1.

Nanodrop profile of contaminant vs non-contaminant samples

       
        IND MLW  UK
  FALSE 835 285 172
  TRUE  107  21  16

Nanodrop profile of samples over time

Wilcox p values for contaminant vs non-contaminant samples
  • India: 0
  • Malawi: 0.138
  • UK: 0
Summary
  • n samples: 1276
  • n controls: 56
  • n features: 1718 (22.8%)
  • n features removed: 0
  • n samples removed: 104
  • n reads: 1.2611155^{8} (88%)
  • Samples clustering with extraction controls had significantly lower input DNA concentrations.

Merge with stool samples

Summary: pre stringent filtering (breastmilk, stool, controls)
  • Number of samples: 4662
  • Number of taxa: 10159
Summary: post stringent filtering (breastmilk, stool, controls)
  • Number of samples: 4558 (number removed: 104)
  • Number of taxa: 7439 (number removed: 2720)

Original filtering step 6: remove duplicates (including validation samples sequenced at London) - ps6

  • n samples: 3695
  • n controls: 340
  • n features: 7343

Select rarefaction depth

Lines display depths of 5,000, 15,000 and 50,000 sequences. Abbreviations: BM, breastmillk; BS, baby stool; MS, maternal stool.

      n  d5k d15k d50k
BM 1224 1201 1149  873
BS 2025 1984 1977 1927
MS  446  443  441  429

Rarefaction depth of 1.5^{4} sequences per sample retains 97.6% of infant samples, 98.9% of maternal samples, and 93.9% of breastmilk samples. Note: rarefaction depth of 25,000 used for analyses focusing specifically on stool samples.

Original filtering step 7: remove samples with <15,000 sequences - ps7

  • n samples: 3567
  • n controls: 222
  • n features: 7169

Original filtering step 8: rarefy to 15,000 sequences - ps8

  • n samples: 3567
  • n controls: 222
  • n features: 7032

Filtering statistics

                        n_samples n_taxa total_count   min   mean     sd
ps1 (unfiltered)             4680  81278   599439742     2 128085 177663
ps2 (length)                 4680  69094   596953865     2 127554 177552
ps3 (taxonomy)               4680  66231   595091220     2 127156 177351
ps4 (≥0.1% in >1)            4673  10159   581503484     2 124439 174940
ps5 (strict decontam)        4558   7439   554538470     2 121663 171108
ps6 (no duplicates)          4035   7343   514393687     2 127483 180008
ps7 (samples with ≥15k)      3789   7169   513573008 15477 135543 182866
ps8 (rarefied to 15k)        3789   7032    56835000 15000  15000      0

Statistics by sample type

                     nsamples ntaxa total_count min     av     sd
ps1 (infant stool)       1856 10568   269774889   4 145353 142163
ps1 (maternal stool)      466 21698    62211349  78 133501 107906
ps1 (breastmilk)         1334 50579   146574829  13 109876 147375

Boxplots of filtering statistics by sample type

Filtering retention by sample type - ps5 (decontam strict)

Mean retention % after taxon filtering (ps5):
* Infant stools = 99.9
* Maternal stools = 96.8
* Breastmilk = 85.2

Statistics by sample type - filtered and deduplicated

                     nsamples ntaxa total_count   min     av     sd
ps7 (infant stool)       1704  4677   250977560 16522 147287 145540
ps7 (maternal stool)      441  4423    57620735 23681 130659 105046
ps7 (breastmilk)         1149  1716   109534914 15477  95331 112958

Statistics by sample type - all faecal samples

                nsamples ntaxa total_count   min     av     sd
ps7 (all stool)     2145  6423   308598295 16522 143869 138328

Statistics by sample type - rarefied

                     nsamples ntaxa total_count min_count av_count NA
ps8 (infant stool)       1704  4372    25560000     15000    15000  0
ps8 (maternal stool)      441  4332     6615000     15000    15000  0
ps8 (breastmilk)         1149  1715    17235000     15000    15000  0

Summary of filtered taxa

Ten most frequent taxanomic assignments displayed for each group. Remaining taxa grouped as ‘other’. Bar heights represent proportion of RSVs assigned to taxon, independent of their relative abundance.

Infant stool

Maternal stool

Breastmilk

Run-to-run variation

Column

Breastmilk

Alpha and beta diversity in positive controls

Sample profile


  BMctrl MCctrlBM 
      17       15 

Variation explained by run for each sample type

Sample profile

  country     sample_type n_samples
1     IND  week of life 1       274
2     IND  week of life 7       247
3     IND week of life 11       232
4      UK  week of life 1        61
5      UK  week of life 7        39
6      UK week of life 11        41
7     MLW  week of life 1        95
8     MLW  week of life 7        81
9     MLW week of life 11        79

Sample subsets in which permanova p value <0.05 for either weighted or unweighted analyses

  country     sample_type  R2_w   p_w  R2_u   p_u n_runs n_samples
1     IND  week of life 1 0.015 0.032 0.022 0.001      4       274
2     IND  week of life 7 0.023 0.001 0.030 0.001      4       247
3     IND week of life 11 0.026 0.001 0.036 0.001      4       232
7     MLW  week of life 1 0.028 0.073 0.033 0.002      3        95
8     MLW  week of life 7 0.039 0.022 0.049 0.001      3        81
9     MLW week of life 11 0.038 0.028 0.044 0.001      3        79

Supplementary figure

10% validation

Column

Breastmilk

Input data

ps5_validation
  • biom table containing paired Liverpool/London breastmilk samples in which both have at least 15,000 reads
  • nsamples = 160
  • ntaxa = 1352

See outputs of analysis module 1 for further details of feature table filtering process.

Alpha diversity (rarefied to 15,000 sequences per samples)

Sample profile

            
             IND MLW UK
  breastmilk  26  24 30

Beta diversity plots

Supplementary Figure

Correlation of genus abundances for top-20 genera